查看原文
其他

你以为的可能不是你以为的

jimmy 生信技能树 2022-06-07

最近生信技能树管理员小朋友XZG跟我炫耀他植物的简化基因组的gvcf模式,两百个测序数据,我一直没用过这个gvcf功能,因为的确没有需求。癌症研究,关注的主要是somatic mutation。

而且一直以为gvcf是把所有位点都report出来,不管那个位点有没有突变与否,只需要测序覆盖到了!!所以输出的文件行数会非常夸张!!!因为正常人的germline mutation在千分之一左右,所以gvcf得到的vcf会暴涨到1000倍行数。

但是简单的测试了一个坐标区间,发现事实可能不是我想象的那么简单!

$GATK  HaplotypeCaller  -ERC GVCF  -L $bed -R $GENOME -I $bam  --dbsnp $DBSNP -O tmp.vcf

我使用的gatk版本信息如下:

Using GATK jar /home/jianmingzeng/biosoft/GATK/gatk-4.0.3.0/gatk-package-4.0.3.0-local.jar
Running:
   java -Dsamjdk.use_async_io_read_samtools=false
   -Dsamjdk.use_async_io_write_samtools=true
   -Dsamjdk.use_async_io_write_tribble=false
   -Dsamjdk.compression_level=2
   -Xmx20G -Djava.io.tmpdir=./
   -jar /home/jianmingzeng/biosoft/GATK/gatk-4.0.3.0/gatk-package-4.0.3.0-local.jar HaplotypeCaller
USAGE: HaplotypeCaller [arguments]
Call germline SNPs and indels via local re-assembly of haplotypes
Version:4.0.3.0

我测试的这个坐标区别是 1217 bp ,在gatk的运行日志也可以看到

IntervalArgumentCollection - Processing 1217 bp from intervals

2249 read(s) filtered by: MappingQualityReadFilter

那么,按照道理,出来的vcf应该是1217行,结果:

grep -v "^#" tmp.vcf|wc
   119    1190    9875

那么,为什么会缺失这么多呢?我首先想到的是我给的坐标区间大部分并没有被这个bam文件所覆盖,也就是说,没有测序到。

所以我用下面的命令检验:

samtools mpileup  -r chr1:68940-70157 $bam |wc
[mpileup] 1 samples in 1 input files
<mpileup> Set max per-file depth to 8000
  1218    7308  581882

发现我给的坐标区间都被覆盖了,然后我想到了可能是很多reads测序质量太差被过滤了,所以我看了看质量

samtools view $bam chr1:68940-70157|wc
  2896   57920 1475460
samtools view $bam -q 20  chr1:68940-70157|wc
   651   13020  334253

看起来,测序质量高的reads不多,因为我测试坐标的特殊性。然后我看了看gvcf文件的前五列:

grep -v "^#" tmp.vcf|cut -f 1-5|head
chr1    68941   .   A   <NON_REF>
chr1    69072   .   G   <NON_REF>
chr1    69073   .   A   <NON_REF>
chr1    69083   .   A   <NON_REF>
chr1    69090   .   T   <NON_REF>
chr1    69101   .   A   <NON_REF>
chr1    69130   .   C   <NON_REF>
chr1    69159   .   G   <NON_REF>
chr1    69184   .   G   <NON_REF>
chr1    69185   .   T   <NON_REF>

有些坐标跳跃了,也有些是连续的,再跟mpileup对应一下,如下:

samtools mpileup  -r chr1:68940-70157 $bam |cut -f 1,2,4,5|head
[mpileup] 1 samples in 1 input files
<mpileup> Set max per-file depth to 8000
chr1    68940   35  TTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTT^!T
chr1    68941   36  AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA^!A
chr1    68942   36  TTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTT
chr1    68943   36  A$AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA
chr1    68944   36  GGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGG^!G
chr1    68945   37  AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA^!A
chr1    68946   38  AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA^!A
chr1    68947   38  AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA
chr1    68948   40  GGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGG^!G^!G
chr1    68949   41  TTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTT^!T

实在是太奇怪了,那些被跳跃的位点,的确有测序深度信息呀!

而且,虽然 -q 20 -Q 20 这样的reads过滤可以大幅度丢失reads,但其实并不会丢失坐标信息:

[jianmingzeng@jade germline]$ samtools mpileup -q 20 -Q 20  -r chr1:68940-70157 $bam |head
[mpileup] 1 samples in 1 input files
<mpileup> Set max per-file depth to 8000
chr1    69072   N   1   ^:G ?
chr1    69073   N   2   A^:A    C?
chr1    69074   N   2   GG  >>
chr1    69075   N   2   TT  AA
chr1    69076   N   2   GG  ??
chr1    69077   N   2   AA  AC
chr1    69078   N   2   AA  BB
chr1    69079   N   2   AA  CB
chr1    69080   N   2   CC  AA
chr1    69081   N   2   GG  99
[jianmingzeng@jade germline]$ samtools mpileup  -r chr1:68940-70157 $bam |head |cut -f 1-5
[mpileup] 1 samples in 1 input files
<mpileup> Set max per-file depth to 8000
chr1    68940   N   35  TTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTT^!T
chr1    68941   N   36  AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA^!A
chr1    68942   N   36  TTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTT
chr1    68943   N   36  A$AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA
chr1    68944   N   36  GGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGG^!G
chr1    68945   N   37  AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA^!A
chr1    68946   N   38  AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA^!A
chr1    68947   N   38  AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA
chr1    68948   N   40  GGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGG^!G^!G
chr1    68949   N   41  TTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTT^!Tl

不过还没等我仔细分析个中缘由,XZG就直接甩给了我一个链接,里面很清楚写到

With GVCF, you get a gVCF with individual variant records for variant sites, but the non-variant sites are grouped together into non-variant block records that represent intervals of sites for which the genotype quality (GQ) is within a certain range or band.

block挺好的呀,节省空间,要真的是bp分辨率记录,那么文件会大的可怕! 不过没有突变的地方既然被折叠了,也就失去了该位点的测序深度信息了。

不过仔细想想,GATK团队应该不至于没考虑到这一点,可能是有着同样测序深度的地方才会被折叠!问题在于我选取了前五列查看信息,其实应该看全部:

grep -v "^#" tmp.vcf|cut -f 1-8|head
chr1    68941   .   A   <NON_REF>   .   .   END=69071
chr1    69072   .   G   <NON_REF>   .   .   END=69072
chr1    69073   .   A   <NON_REF>   .   .   END=69082
chr1    69083   .   A   <NON_REF>   .   .   END=69089
chr1    69090   .   T   <NON_REF>   .   .   END=69100
chr1    69101   .   A   <NON_REF>   .   .   END=69129
chr1    69130   .   C   <NON_REF>   .   .   END=69158
chr1    69159   .   G   <NON_REF>   .   .   END=69183
chr1    69184   .   G   <NON_REF>   .   .   END=69184
chr1    69185   .   T   <NON_REF>   .   .   END=69198

折叠的非常巧妙。

具体GATK的gvcf可以阅读:https://software.broadinstitute.org/gatk/documentation/article.php?id=3893

https://gatkforums.broadinstitute.org/gatk/discussion/4017/what-is-a-gvcf-and-how-is-it-different-from-a-regular-vcf

我不想赚你的钱,不行吗? (推荐阅读)

您可能也对以下帖子感兴趣

文章有问题?点此查看未经处理的缓存